Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ballista shuffle is finally working as intended, providing scalable distributed joins #750

Merged
merged 11 commits into from
Jul 21, 2021

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jul 18, 2021

Which issue does this PR close?

Builds on #738 Closes #707.

With this PR we finally have scalable distributed joins.

Query 12 performance at SF=100

Executor concurrent tasks Time (ms)
2 32592.94
4 17865.03
8 11641.62
16 9296.60
+------------+-----------------+----------------+
| l_shipmode | high_line_count | low_line_count |
+------------+-----------------+----------------+
| MAIL       | 623097          | 934694         |
| SHIP       | 622959          | 934510         |
+------------+-----------------+----------------+
Query 12 avg time: 9896.37 ms

Integration tests pass.

Rationale for this change

This is making Ballista work as it was intended to work.

What changes are included in this PR?

Tons of bug fixes around shuffles.

Are there any user-facing changes?

No

@andygrove andygrove marked this pull request as ready for review July 20, 2021 04:15
@andygrove
Copy link
Member Author

@houqp @Dandandan @edrevo @alamb @jorgecarleitao Ballista is finally working with scalable distributed joins, at least it is for TPC-H. I plan on following up with some further smaller code cleanup PRs now that the functionality is working.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code -- while I am not a ballista expert it seems reasonable to me.

One thing I did notice was that there don't appear to be any new / updated tests in this PR.

@@ -69,7 +78,8 @@ impl ExecutionPlan for UnresolvedShuffleExec {
}

fn output_partitioning(&self) -> Partitioning {
Partitioning::UnknownPartitioning(self.partition_count)
//TODO the output partition is known and should be populated here!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something that you want to finish up in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed https://github.com/apache/arrow-datafusion/issues/758 as a follow-up for implementing this since it involves more serde work.

@andygrove
Copy link
Member Author

One thing I did notice was that there don't appear to be any new / updated tests in this PR.

I've added an additional test to check that TPC-H query 12 gets planned with correct partitioning information in the shuffle readers.

@andygrove andygrove merged commit ed5746d into apache:master Jul 21, 2021
@andygrove andygrove deleted the ballista-shuffle-working branch July 21, 2021 00:25
@alamb
Copy link
Contributor

alamb commented Jul 21, 2021

🎉

@houqp houqp added the enhancement New feature or request label Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ballista: Finish implementing shuffle mechanism
3 participants